Homework 10.1

First we need to load the data. For convenience, since the time does not reset between growth events, we will compute this manually so we have a regularized time. Last, we will make a quick plot of the first event to get some background information that we would presumably have given we were doing the experiment, such as the order of magnitude of Caulobacter size and lifetime.

We want to model the Caulobacter growth using two models: linear and exponential. The linear model is given by $a(t)=a_0+bt$, and the exponential by $a(t)=a_oe^{kt}$. For both models, we will use the same structure, where if $a(t)$ represents our model's growth function, we will model the area of the bacteria with the likelihood

$$a_i \sim \textrm{Norm}(a(t_i), \sigma).$$

We want to use a hierarchical model with the growth rates for each growth event for each bacteria being given by $b^m_n$ and $k^m_n$ being the rates for bacteria $m$ and growth event $n$. We will have the hyperparameters $b$ and $k$, with the first level of the hierarchy modeling the variability for each cell from the hyperpriors being given by $b^m \sim \textrm{Norm}(b,\sigma_b)$ and $k^m \sim \textrm{Norm}(k,\sigma_k)$, and the second level of the hierarchy modeling the variability for each growth event from the cell's growth rates given by $b^m_n \sim \textrm{Norm}(b^m,\sigma_b)$ and $k^m_n \sim \textrm{Norm}(k^m,\sigma_k)$, where we are using the same $\sigma_{b/k}$ for each level of the hierarchy.

This means that we need priors for $a_0$, $b$, $k$, $\sigma$, $\sigma_{b}$, and $\sigma_{k}$.

In each model, $a_0$ represents the initial size of the bacteria at the start of each growth event. A quick search finds that Caulobacter are about 0.7 µm across and 2-3 µm long, so we estimate the area to be about 1.75 µm². Since we want the initial size, which is likely a bit lower, we will guess a value of about 1.5 and take the prior $a_0 \sim \textrm{Norm}(1.5, 0.75)$. This will likely give us some negative initial sizes, but sampling the real data will deal with this, and we would rather be too broad than too narrow.

We know that the Caulobacter lifetime is on the order of an hour or two, and since it grows and then divides into two, we would expect $b$ to be about $1.25\,/90=0.01388$. If the lifetime is about 30 minutes and the bacteria quadruples in size, this gives a rough upper bound of $4\,/30=0.13333$, and if the bacteria doesn't grow, this gives a lower bound of $0$. Thus, we will take the prior $b \sim \textrm{HalfNorm}(0.05)$.

We can use similar reasoning to come up with a prior for $k$. We expect $\ln(r)/t_f$, where $r$ is the ratio of final to initial areas and $t_f$ is the time at cell division, to be an estimate for $k$. For $r=4$ and $t_f=90$, we get an upper bound of about 0.01540, so we will take the prior $k \sim \textrm{HalfNorm}(0.0075)$.

The parameter $\sigma$ will be same for all $a_i$ since we expect the bacteria growth to be similarly noisy across all times since the area starts and ends at similar values (i.e., it varies over less than an order of magnitude within a growth event). Given our quick estimate for the average area of Caulobacter and the rough estimate that we expect between 0 and 30% variability between measurements, we will take the prior $\sigma \sim \textrm{HalfNorm}(0.3)$.

For $\sigma_b$ and $\sigma_k$ we will use similar priors, namely $\sigma_b \sim \textrm{HalfNorm}(0.025)$ and $\sigma_k \sim \textrm{HalfNorm}(0.0035)$.

The full model is then

$$\begin{align} a_{i\,n}^{\,\,m} &\sim \textrm{Norm}(a^m_n(t_i), \sigma) \\[0.5em] a^m_n(t) = a_0+b^m_nt \;\; &\textrm{OR} \;\; a^m_n(t)=a_0e^{k^m_nt} \\[0.5em] b &\sim \textrm{HalfNorm}(0.05) \\[0.5em] k &\sim \textrm{HalfNorm}(0.0075) \\[0.5em] b^m &\sim \textrm{Norm}(b,\sigma_b) \\[0.5em] k^m &\sim \textrm{Norm}(k,\sigma_k) \\[0.5em] b^m_n &\sim \textrm{Norm}(b^m,\sigma_b) \\[0.5em] k^m_n &\sim \textrm{Norm}(k^m,\sigma_k) \\[0.5em] \sigma_b &\sim \textrm{HalfNorm}(0.025) \\[0.5em] \sigma_k &\sim \textrm{HalfNorm}(0.0035) \\[0.5em] \sigma &\sim \textrm{HalfNorm}(0.3) \\[0.5em] a_0 &\sim \textrm{Norm}(1.5, 0.75) \\[0.5em] \end{align}$$

Now we can do our prior predictive checks with Stan. Due to the size of the data set and to avoid plotting a huge number of points at once, we will only take 500 prior predictive samples, otherwise the browser has a very hard time.

To plot our data, we will select from every sample a random growth event for a random bacteria and plot the data for all of our selections on top of one another for each model.

These look pretty good. There are some negative values and some values that are certainly way too high, but most of our predictive data sets fall in a decent envelope of where we would expect them to be, and the parameter values will be refined through sampling. We would like to be able to do SBC, but given our computing and time constraints, we will not, and so we will move on to coding up the model and doing MCMC.

Our sampling looks okay. We are close to getting ESS of 400 for $b$ and $k$, and the parameters $\sigma_b$ and $\sigma_k$ are not hugely important. The Rhats are very close to 1.01 and are within the looser guideline of 1.1. We would get more samples if we had time as this would likely fix these issues, and we tried to run with 2000, but Jupyter Notebooks crashed at the very end and we lost our results, so we only had time to rerun with 1000.

Let's take a look at our posterior predictive checks. We will use the same strategy as with the prior predictive checks but only use every 20th sample out of 4000 so it's not too much to handle. We will also overlay the real data.

These look pretty good. The eye test says that the linear model might be slightly better since it seems a bit tighter to the data compared to the exponential for larger times.

Let's look at our parameter estimates and see what we can get out of them.

As we guessed from the plot, $\sigma$ for the linear model is smaller than $\sigma$ for the exponential. A quantity that we may be interested in looking at is the ratio of the rate parameter to the scale parameter for how that rate varies from trial to trial, i.e. $b\,/\sigma_b$ and $k\,/\sigma_k$.

It looks like our estimate for $b$ varies relatively less from trial to trial than $k$. We can take this as further indication that the linear model is a better fit to the data since we can get a more precise parameter estimate according to trial to trial variation. This means the model doesn't have to be as flexible to fit the data, which is likely indicative of a better model.

Let's try a model comparison using the LOO.

The LOO significantly prefers the linear model, as we thought it might, although not overwhelmingly so. Interestingly, the p_loo column contains the effective number of parameters, and this value is higher for the exponential model than for the linear. This is likely because of what we were just discussing above regarding the values of $b/\sigma_b$ and $k/\sigma_k$, where the greater uncoupling of the trials for the exponential model results in a greater number of effective parameters, which likely hurts the LOO and causes the linear model to be preferred.

From everything that we have done, it seems like the linear model for Caulobacter growth is a better model than the exponential model, in opposition to what the authors of the paper found. The median parameter estimates for this model are given below (intervals can be found above).

Of course, the big question hanging over everything is whether Caulobacter growth may actually be exponential, but on the time scale observed, the linear and exponential models are indistinguishable since an exponential is essentially linear for small $t$. We have a couple of things to say regarding this point.

First, there is some evidence that growth is indeed linear and not exponential from the larger fanning out in the posterior predictive check for the exponential than for the linear. This indicates that we are starting to get into the time domain where the non-linearity of the exponential is starting to kick in, and we can see that it does not quite match the data. Additionally, we know that on timescales, the exponential is essentially linear with $b=ka_0$. Let's look at this quantity versus our actual $b$ estimate.

We see that these values are not in fact equal since $b$ is about 142% of $ka_0$, which is significant. This is further evidence that we are reaching the time domain where the non-linearity matters, and we prefer the linear model for this experimental data.

Second, the question of whether or not growth is linear or exponential could be explored further in another experiment by somehow increasing the time to division, but this would add confounding factors that would make the results unreliable.

Finally, and somewhat related to the last point, is that the question of whether or not growth is linear or exponential on a larger timescale is not really a good question to be asking in the first place. This is because the only timescale that actually matters and that we are interested in is the timescale of the Caulobacter life cycle, and we have shown that the linear growth model is preferable in this domain.

This homework was contributed to by Danny, Happy, and David.